Boosting Statistical Machine Translation by Lemmatization and Linear Interpolation

نویسندگان

  • Ruiqiang Zhang
  • Eiichiro Sumita
چکیده

Data sparseness is one of the factors that degrade statistical machine translation (SMT). Existing work has shown that using morphosyntactic information is an effective solution to data sparseness. However, fewer efforts have been made for Chinese-to-English SMT with using English morpho-syntactic analysis. We found that while English is a language with less inflection, using English lemmas in training can significantly improve the quality of word alignment that leads to yield better translation performance. We carried out comprehensive experiments on multiple training data of varied sizes to prove this. We also proposed a new effective linear interpolation method to integrate multiple homologous features of translation models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhancing Lemmatization for Mongolian and its Application to Statistical Machine Translation

Lemmatization is crucial in natural language processing and information retrieval especially for highly inflected languages, such as Finnish and Mongolian. The state-of-the-art method of lemmatization for Mongolian does not need a noun dictionary and is scalable, but errors of this method are mainly caused by problems related to part of speech (POS) information. To resolve this problem, we inte...

متن کامل

Minimum error training of log-linear translation models

Recent work on training of log-linear interpolation models for statistical machine translation reported performance improvements by optimizing parameters with respect to translation quality, rather than to likelihood oriented criteria. This work presents an alternative and more direct training procedure for log-linear interpolation models. In addition, we point out the subtle interaction betwee...

متن کامل

NRC Russian-English Machine Translation System for WMT 2016

We describe the statistical machine translation system developed at the National Research Council of Canada (NRC) for the Russian-English news translation task of the First Conference on Machine Translation (WMT 2016). Our submission is a phrase-based SMT system that tackles the morphological complexity of Russian through comprehensive use of lemmatization. The core of our lemmatization strateg...

متن کامل

Applying boosting to statistical machine translation

Boosting is a general method for improving the accuracy of a given learning algorithm under certain restrictions. In this work, AdaBoost, one of the most popular boosting algorithms, is adapted and applied to statistical machine translation. The appropriateness of this technique in this scenario is evaluated on a real translation task. Results from preliminary experiments confirm that statistic...

متن کامل

Applying Pairwise Ranked Optimisation to Improve the Interpolation of Translation Models

In Statistical Machine Translation we often have to combine different sources of parallel training data to build a good system. One way of doing this is to build separate translation models from each data set and linearly interpolate them, and to date the main method for optimising the interpolation weights is to minimise the model perplexity on a heldout set. In this work, rather than optimisi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007